2.1 Income
2.3 Big Five Personality Traits
2.4 Gender
2.5 Education Level
3.1. Correlation matrix
3.2. Covariance Matrix
3.3. Data Generation
3.5. Classification of Education Level
3.7. Details on dataset
3.8. Visualisation
The purpose of this project is to simulate a dataset which represents income in the United States and the various factors that are associated with same. In this instance, the factors are cognitive ability and personality (Big Five traits Openness, Conscientiousness, Extraversion, Agreeableness and Neuroticism). In addition, gender and education level will be added to the dataset.
The study of the distribution of Income and the factors that correlate with it are of interest because of questions concerning income equality and how much of success is due to so called "fair" factors such as personality and cognitive ability and "unfair" factors such as parental wealth, social status and gender privilege.
Each of the factors and their distributions are described as follows:
The factor that is of primary concern in this simulation is total lifetime income in the United States. The aim of the project is to simulate a distribution of total lifetime income given the mean and standard deviation of total lifetime income given in the paper "Who Does Well in Life? Conscientious Adults Excel in Both Objective and Subjective Success" 1. Total lifetime income is defined as the cumulative income earned on average by an individual of age 68 years with range 30-91 years (so on average lifetime income is cumulative income earned over 50 years from 18-68 years of age).
Generally, income (including lifetime income) is best described using the lognormal distribution as the mode tends to be less than the median which is less than the mean (reflecting income equality). However, due to the earnings being capped at a taxable maximum in the paper, the data is not sufficiently skewed to justify the transformation and can be modelled as a normal distribution.
The mean and standard deviation of total lifetime income are 980,000 US dollars and 738,000 US dollars respectively.
m_i = 980000
stdev_i = 738000
NOTE: Negative income values are not omitted. This is because it is possible to have a net negative lifetime income if one is heavily in debt at the age of 68.
Cognitive ability as measured by the paper includes memory, vocabulary and numeracy. It correlates positively with income. These measures (in particular vocabulary) correlate very highly with IQ 2. IQ is measured initially on an ordinal scale with percentiles but is approximated on the interval scale as a normal distribution or bell curve across a whole population 3. Cognitive ability correlates positively with income.
In the paper, the mean and standard deviation are 0 and 1. It is normally distributed like IQ.
m_ca = 0
stdev_ca = 1
The Big Five Personality traits are an attempt by psychologists to encapsulate and quantify several personality traits 4. These traits are similar to IQ in that they are measured on the ordinal scale by psychologists but are approximated on the interval scale as a normal distribution across a whole population 5. The traits are measured on a scale where 4 means an individual scores extremely high in a trait and 1 means they score extremely low in a trait.
The Big Five traits are Openness, Conscientiousness, Extraversion, Agreeableness and Emotional Stability/Neuroticism (the facets and domains of the Big Five are not considered here). The mean and standard deviations are drawn from the paper.
Openness measures the level of interest in art, intellectual pursuits and creativity in an individual. It also measures how unconventional and fantasy-prone they are. It correlates positively with income. The mean is 1.95 and the standard deviation is 0.55.
Conscientiousness measures how industrious, organised and self-disciplined an individual is. It also measures how cautious and dutiful they are. It correlates positively with income. The mean is 2.56 and the standard deviation is 0.48.
Extraversion measures how talkative, assertive and sensation-seeking an individual is. It also measures their level of activity and positive emotions. It correlates negatively with income. The mean is 2.2 and 0.55.
Extraversion measures how modest, altruistic and honest an individual is. It also measures how compassionate and trusting they are. It correlates negatively with income. The mean is 2.53 and the standard deviation is 0.47.
Emotional stability is the reverse of Neuroticism. Neuroticism measures how anxious, fearful and depressed an individual is. It also measures how self-conscious and impulsive they are. It correlates negatively with Income, whereas its reverse Emotional Stability correlates positively with Income. The mean (of Emotional Stability) is 2.71 and the standard deviation is 0.61.
# means of big five traits
m_o = 1.95
m_c = 2.56
m_e = 2.2
m_a = 2.53
m_es = 2.71
# standard deviationns of big five traits
stdev_o = 0.55
stdev_c = 0.48
stdev_e = 0.55
stdev_a = 0.47
stdev_es = 0.61
Gender, as defined in this project, is whether the individual identified as male or female. Since this project's purpose is to generate a dataset which correlates with total lifetime income for people who are on average 68 years old, the expectation is that men's lifetime income will dwarf women's lifetime income because for much of the 20th century men were the sole breadwinners of households, in the United States and in other industrialised nations 6.
In this project, it will be represented as a categorical variable which can be either 'M' or 'F'. It will be generated by taking the income brackets, getting the probability of a given individual being in that bracket if they're male and generating a 1 or 0 from the binomial distribution which will then be set to 'M' or 'F' respectively and incorporated into the dataset as such.
Education level is defined in this project as the highest level of education attained by an individual in their lifetime. There are five levels of American education described here, from lowest to highest: less than high school (LTHS), high school graduate (HSG), some college (SC), bachelor’s degree only (BA), and graduate degree attainment (GRAD). These were taken from the paper "Education and Lifetime Earnings in the United States" 7. Higher levels of education are associated with higher incomes 8.
In the paper, bar charts of education levels and gross lifetime earnings are presented for men and women. The values for earnings and education level are drawn from these barcharts. Education level is a categorical variable in the paper but for the sake of simplicity they will be changed to a numerical variable where 1 is the lowest level (LTHS) and 5 is the highest level (GRAD). The correlation between gross lifetime earnings and education is then calculated and the dataset representing the level of education is generated using that correlation.
Afterwards, when the values for Education level are appended to the dataset, the data is transformed back into a categorical variable.
To generate the correlation between Income and Education, data from the second paper referenced is used in conjunction with the module scipy.stats.stats to generate the correlation coefficient between lifetime earnings (Income) and highest level of education achieved (Education).
For correlations between educational level and the Big Five personality traits, the correlations between GPA (Grade Point Average) and the Big Five taken from "Personality Predictors of Academic Outcomes: Big Five Correlates of GPA and SAT Scores" 9 are used (this is a simplification as GPA or grade point average correlates highly with educational level but is not identical to educational level).
In addition, the correlation between GPA and IQ (and hence cognitive ability) is taken from "Personality and Intelligence Interact in the Prediction of Academic Achievement" 10.
# education level and lifetime earnings
elae = [(1, 1.13), (1,0.51), (2, 1.54), (2, 0.8), (3, 1.76), (3, 1.01), (4, 2.43), (4, 1.43), (5, 3.05),
(5, 1.86)]
# generate Pearson's correlation coefficient
from scipy.stats.stats import pearsonr
listx =[]
listy = []
# list comprehensions to get x and y values
# for pearsonr function
[listx.append(i[0]) for i in elae]
[listy.append(i[1]) for i in elae]
# using pearsonr function to get correlation coefficient
# between education level and income
pie = round(pearsonr(listx,listy)[0], 2)
print("The correlation between Income and Education is: {:.2f}".format(pie))
# the correlations between Big Five and GPA (Education) are:
# Conscientiousness
pce = 0.26
# Openness
poe = 0.05
# Extraversion
pee = -0.04
# Agreeableness
pae = 0.09
# Emotional Stability (reverse of Neuroticism)
pne = 0.07
print("Correlations for Big Five traits C, O, E, A, ES are: {}, {}, {}, {}, {}".format(pce, poe, pee, pae, pne))
# The correlation between Cognitive ability and Education:
pcae = 0.31
print("The correlation between Cognitive Ability and Education is: {}".format(pcae))
The correlation between Income and Education is: 0.78 Correlations for Big Five traits C, O, E, A, ES are: 0.26, 0.05, -0.04, 0.09, 0.07 The correlation between Cognitive Ability and Education is: 0.31
Since all the factors correlate with income (and most the factors also correlate with each other to some degree), the multivariate normal distribution is used to generate the dataset 11. The correlation matrix used in the multivariate normal distribution is entered using the correlations from the first paper as well as the correlation calculated in the previous cell. The rest are taken from the second paper (the pretty_print_matrix function is taken from Stack Overflow 12).
# this function prints out a matrix in
# a presentable fashion
# taken from Stack Overflow
def pretty_print_matrix(matrix):
s = [[str(e) for e in row] for row in matrix]
lens = [max(map(len, col)) for col in zip(*s)]
fmt = '\t'.join('{{:{}}}'.format(x) for x in lens)
table = [fmt.format(*row) for row in s]
print('\n'.join(table))
# rows of correlation matrix
# Conscientiousness row
p1= [1, 0.67, 0.61, 0.63, 0.23, 0.27, pce, 0.08]
# Openness row
p2 = [0.67, 1, 0.68, 0.51, 0.2, 0.27, poe, 0.1]
# Extraversion row
p3 = [0.61, 0.68, 1, 0.8, 0.25, 0.01, pee, -0.04]
# Agreeableness row
p4 = [0.63, 0.51, 0.8, 1, 0.07, 0.02, pae, -0.14]
# Emotional Stability row
p5 = [0.23, 0.20, 0.25, 0.07, 1, 0.16, pne, 0.16]
# Cognitive ability row
p6 = [0.27, 0.27, 0.01, 0.02, 0.16, 1, pcae, 0.34]
# Education row
p7 = [pce, poe, pee, pae, pne, pcae, 1, pie]
# Income row
p8 = [0.08, 0.10, -0.04, -0.14, 0.16, 0.34, pie, 1]
# correlation matrix
corr = [p1, p2, p3, p4, p5, p6, p7, p8]
pretty_print_matrix(corr)
1 0.67 0.61 0.63 0.23 0.27 0.26 0.08 0.67 1 0.68 0.51 0.2 0.27 0.05 0.1 0.61 0.68 1 0.8 0.25 0.01 -0.04 -0.04 0.63 0.51 0.8 1 0.07 0.02 0.09 -0.14 0.23 0.2 0.25 0.07 1 0.16 0.07 0.16 0.27 0.27 0.01 0.02 0.16 1 0.31 0.34 0.26 0.05 -0.04 0.09 0.07 0.31 1 0.78 0.08 0.1 -0.04 -0.14 0.16 0.34 0.78 1
The correlation matrix should be symmetric. A quick way to test this is by checking if it is equal to its transpose. This is accomplished using the numpy package and the T (transpose) function in numpy:
# compare if correlation matrix and its transpose are equal
comparison = np.array(corr) == np.array(corr).T
equal_arrays = comparison.all()
print(equal_arrays)
True
The transpose of the matrix is equal to the matrix, therefore the matrix is symmetric.
The correlation matrix is used to generate a covariance matrix 13.
The covariance matrix is a matrix of numbers which indicate how the variables in the dataset relate to each other. It is calculated using the correlation matrix and an array of standard deviations of the variables in the dataset.
The code to do this is as follows:
# calculate covariance matrix from correlation matrix
from statsmodels.stats.moment_helpers import corr2cov
# import numpy for numpy arrays
import numpy as np
# set precision of numpy
np.set_printoptions(precision=3)
# import pandas as pd
import pandas as pd
# set precision to 3
pd.set_option("display.float_format", lambda x: '%.3f' % x)
# get mean and standard
# deviation for education level
m_ed = np.mean(listx)
stdev_ed = np.std(listx)
# mean and standard deviation lists
mean_list = [m_c, m_o, m_e, m_a, m_es, m_ca, m_ed, m_i]
std_list = [stdev_c, stdev_o, stdev_e, stdev_a, stdev_es, stdev_ca, stdev_ed, stdev_i]
# using corr2cov to generate covariance matrix
# using the correlation matrix corr
# and std_list (standard deviation list)
cov = corr2cov(corr, std_list)
The dataframe is generated by taking the covariance matrix and list of means and converting them into numpy arrays. They are used in the np.random.multivariate_normal function to generate the dataset. The dataset is 500 data points in size.
The dataset is then used as data for a pandas dataframe called df.
# generate the dataset
dataset = np.random.multivariate_normal(np.array(mean_list), np.array(cov), 500)
# dataframe df of dataset
df = pd.DataFrame(dataset, columns =['Conscientiousness', 'Openness', 'Extraversion', \
'Agreeableness', 'Emotional Stability', \
'Cognitive Ability', 'Education', 'Income']);
To classify each income range, a very simple classification scheme is used whereby incomes in 98th percentile are categorised as "H" for High, incomes above the 84th percentile and equalling the 98th percentile are categorised as "HM" for High Middle, incomes above the 50th percentile and equalling the 84th percentile are categoried as "LM", incomes above the 16th percentile and equalling the 50th percentile are categorised as "HL" for High Low, incomes above the 2nd percentile and below and equalling the 16th percentile are categorised as "LM" for Low Middle and incomes at or below the 2th percentile are categorised as "L" for low.
This classification scheme works because the following hold true for the percentiles:
98th+ : value > μ + 2σ
84th to 98th : μ + 2σ >= value > μ + σ
50th to 84th : μ + σ >= value > μ
16th to 50th : μ >= value > μ - σ
2th to 16th : μ - σ >= value > μ - 2σ
2th- : value <= μ - 2σ
Where μ is the mean of Income and σ is the standard deviation of Income.
This classification scheme is done as follows:
# Different classes of income
# A small increment of 0.000000001 is made
# to ensure the income brackets conform
# to the scheme outlined above
df.loc[df['Income'] > m_i + 2*stdev_i, 'Income Class'] = 'H'
df.loc[df['Income'].between(m_i + stdev_i +0.000000001, m_i + 2*stdev_i, inclusive=True), 'Income Class'] = 'HM'
df.loc[df['Income'].between(m_i +0.000000001, m_i + stdev_i, inclusive=True), 'Income Class'] = 'LM'
df.loc[df['Income'].between(m_i - stdev_i +0.000000001 , m_i, inclusive=True), 'Income Class'] = 'HL'
df.loc[df['Income'].between(m_i - 2*stdev_i +0.000000001 ,m_i-stdev_i, inclusive=True), 'Income Class'] = 'ML'
df.loc[df['Income'] <= m_i - 2*stdev_i, 'Income Class'] = 'L'
The education level was converted to a numerical variable in Section 3.1. In this section, it is converted back into the categorical variable as follows:
df.loc[df['Education'] >= 5, 'Education Level'] = 'GRAD'
df.loc[df['Education'].between(4, 5-0.000001, inclusive=True), 'Education Level'] = 'BA'
df.loc[df['Education'].between(3, 4 -0.000001, inclusive=True), 'Education Level'] = 'SC'
df.loc[df['Education'].between(2, 3-0.000001, inclusive=True), 'Education Level'] = 'HSG'
df.loc[df['Education'] <= 2-0.000001, 'Education Level'] = 'LTHS'
The "Education" column can now be dropped:
df = df.drop(['Education'], axis=1)
df.head()
| Conscientiousness | Openness | Extraversion | Agreeableness | Emotional Stability | Cognitive Ability | Income | Income Class | Education Level | |
|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.312 | 1.848 | 2.064 | 2.618 | 3.004 | 0.577 | 1221726.386 | LM | BA |
| 1 | 2.651 | 2.576 | 2.554 | 2.811 | 3.800 | -0.325 | 1541878.153 | LM | SC |
| 2 | 2.612 | 0.976 | 1.806 | 2.742 | 2.763 | -0.119 | 1239460.726 | LM | GRAD |
| 3 | 3.334 | 2.686 | 2.323 | 2.180 | 2.639 | 1.332 | 1661026.815 | LM | SC |
| 4 | 3.054 | 1.545 | 2.873 | 2.802 | 4.184 | 1.109 | 1676906.046 | LM | GRAD |
The first paper referenced previously is controlled for gender so gender cannot be classified using data from that paper. However, the mean and standard deviation of lifetime incomes (defined here as cumulative income after 50 years) for men and women are present in the second paper referenced "Education and Lifetime Earnings in the United States". The probabilities that you're a man given you're in a certain income bracket can be calculated using these values.
This is done by taking the lifetime income values for men and women, getting the means and standard deviations for men, women and the total (men and women together), generating normal distributions using these values and getting the proportion of men who are in each bracket (each bracket being delinated by what percentile range they are in the distribution).
This is divided by the number of men and women in that bracket to get the probability of being a man, given that you're in that bracket.
This is done using the following code:
# the lifetime income of men and women by education bracket (in millions)
men = [3.05, 2.43, 1.76, 1.54, 1.13]
women = [1.86, 1.43, 1.01, 0.8, 0.51]
# total
total = men + women
# distributions fo rmen and women
men_dist = np.random.normal(np.mean(men), np.std(men), 1000000)
women_dist = np.random.normal(np.mean(women), np.std(women), 1000000)
# Higher income bracket
prop_men_H = np.sum(men_dist> (np.mean(total) + 2*np.std(total)))
prop_women_H = np.sum(women_dist> (np.mean(total) + 2*np.std(total)))
# proportion of men in this bracket
men_H = (prop_men_H)/(prop_men_H + prop_women_H)
print("Probability of being a man given you're in the highest income bracket {:.2f}%".format(100*men_H))
# Higher middle income bracket
a = (np.mean(total) + 2*np.std(total))
b = (np.mean(total) +np.std(total) + 0.00001)
prop_men_HM = np.sum(men_dist[(men_dist <= a) & (men_dist >= b)])
prop_women_HM = np.sum(women_dist[(women_dist <= a) & (women_dist >=b)])
# proportion of men in this bracket
men_HM = (prop_men_HM)/(prop_men_HM + prop_women_HM)
print("Probability of being a man given you're in the high middle income bracket: {:.2f}%".format(100*men_HM))
# Lower middle income bracket
a = (np.mean(total) + np.std(total))
b = (np.mean(total) + 0.00001)
prop_men_LM = np.sum(men_dist[(men_dist <= a) & (men_dist >= b)])
prop_women_LM = np.sum(women_dist[(women_dist <= (np.mean(total) + np.std(total))) & (women_dist >= (np.mean(total) + 0.00001))])
# proportion of men in this bracket
men_LM = (prop_men_LM)/(prop_men_LM + prop_women_LM)
print("Probability of being a man given you're in the lower middle income bracket: {:.2f}%".format(100*men_LM))
# Higher lower income bracket
a = (np.mean(total))
b = (np.mean(total) -np.std(total) + 0.00001)
prop_men_HL = np.sum(men_dist[(men_dist <= a) & (men_dist >= b)])
prop_women_HL = np.sum(women_dist[(women_dist <= a) & (women_dist >= b)])
# proportion of men in this bracket
men_HL = (prop_men_HL)/(prop_men_HL + prop_women_HL)
print("Probability of being a man given you're in the higher lower income bracket: {:.2f}%".format(100*men_HL))
# Middle lower income bracket
a = (np.mean(total)-np.std(total))
b = (np.mean(total) -2*np.std(total) + 0.00001)
prop_men_ML = np.sum(men_dist[(men_dist <= a) & (men_dist >= b)])
prop_women_ML = np.sum(women_dist[(women_dist <= a) & (women_dist >= b)])
# proportion of men in this bracket
men_ML = (prop_men_ML)/(prop_men_ML + prop_women_ML)
print("Probability of being a man given you're in the middle lower income bracket: {:.2f}%".format(100*men_ML))
# Lower income bracket
prop_men_L = np.sum(men_dist < (np.mean(total) - 2*np.std(total)))
prop_women_L = np.sum(women_dist < (np.mean(total) - 2*np.std(total)))
# proportion of men in this bracket
men_L = (prop_men_L)/(prop_men_L + prop_women_L)
print("Probability of being a man given you're in the lower income bracket: {:.2f}%".format(100*men_L))
Probability of being a man given you're in the highest income bracket 99.94% Probability of being a man given you're in the high middle income bracket: 97.43% Probability of being a man given you're in the lower middle income bracket: 71.35% Probability of being a man given you're in the higher lower income bracket: 29.83% Probability of being a man given you're in the middle lower income bracket: 14.66% Probability of being a man given you're in the lower income bracket: 15.46%
These values are used to generate randomly generated variables using the binomial distrubtion with 1 trial for each income bracket (this is technically the Bernoulli distribution 14). These variables are appended to a list of genders. They are either 1 (Male) or 0 (Female).
The function to generate this list is as follows:
# income class list, list of
# income classes for each row
# in the dataframe
income_class = list(df['Income Class'])
def assign_prob():
# list of genders, which will be placed
# in dataframe column marked Gender
column_gender = []
# for each entry in income class
for i in income_class:
# use binomial function to generate a 1 or 0
# with prob_H
if i=='H':
#prob_H = (men_H * 0.5)/(0.02)
MOF = np.random.binomial(1,men_H,1)[0]
column_gender.append(MOF)
# detto for HM
elif i=='HM':
# prob_HM = (men_HM * 0.5)/(0.136)
MOF = np.random.binomial(1,men_HM,1)[0]
column_gender.append(MOF)
# detto for LM
elif i=='LM':
# prob_LM = (men_LM * 0.5)/(0.34)
MOF = np.random.binomial(1,men_LM,1)[0]
column_gender.append(MOF)
# detto for HL
elif i=='HL':
# prob_HL = (men_HL * 0.5)/(0.34)
MOF = np.random.binomial(1,men_HL,1)[0]
column_gender.append(MOF)
# detto for ML
elif i=='ML':
# prob_ML = (men_ML * 0.5)/(0.136)
MOF = np.random.binomial(1,men_ML,1)[0]
column_gender.append(MOF)
# detto for L
elif i=='L':
# prob_L = (men_L * 0.5)/(0.02)
MOF = np.random.binomial(1,men_L,1)[0]
column_gender.append(MOF)
# list of entries, 1 if male 0 if female
return column_gender
column_list = assign_prob()
print("The proportion of men assigned to this sample is: {}".format(np.sum(column_list)/len(column_list)))
The proportion of men assigned to this sample is: 0.518
The proportion of men (i.e. the number of 1's) generated by this method is 0.518 or 51.8%, which is approximately what one would expect in the general population (50% men, 50% women).
This function is assigned to a column called 'Gender' in the dataframe df:
df['Gender'] = column_list
The dataframe column Gender is made into a categorical variable with values either being 'M' or 'F' based on male being 1 and female being 0, as follows:
df.loc[df['Gender'] == 1, 'Gender'] = 'M'
df.loc[df['Gender'] == 0, 'Gender'] = 'F'
The means and standard deviations of the columns of the dataframe df are given as follows:
print("The means are:\n{}".format(df.mean()))
print("\nThe standard deviations are:\n{}".format(df.std()))
The means are: Conscientiousness 2.569 Openness 1.975 Extraversion 2.203 Agreeableness 2.504 Emotional Stability 2.759 Cognitive Ability 0.052 Income 1002396.979 dtype: float64 The standard deviations are: Conscientiousness 0.478 Openness 0.530 Extraversion 0.554 Agreeableness 0.480 Emotional Stability 0.584 Cognitive Ability 1.018 Income 710227.324 dtype: float64
The first 5 elements are given as follows:
df.head()
| Conscientiousness | Openness | Extraversion | Agreeableness | Emotional Stability | Cognitive Ability | Income | Income Class | Education Level | Gender | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3.312 | 1.848 | 2.064 | 2.618 | 3.004 | 0.577 | 1221726.386 | LM | BA | M |
| 1 | 2.651 | 2.576 | 2.554 | 2.811 | 3.800 | -0.325 | 1541878.153 | LM | SC | F |
| 2 | 2.612 | 0.976 | 1.806 | 2.742 | 2.763 | -0.119 | 1239460.726 | LM | GRAD | F |
| 3 | 3.334 | 2.686 | 2.323 | 2.180 | 2.639 | 1.332 | 1661026.815 | LM | SC | M |
| 4 | 3.054 | 1.545 | 2.873 | 2.802 | 4.184 | 1.109 | 1676906.046 | LM | GRAD | F |
The correlation matrix for this dataset is:
df.corr()
| Conscientiousness | Openness | Extraversion | Agreeableness | Emotional Stability | Cognitive Ability | Income | |
|---|---|---|---|---|---|---|---|
| Conscientiousness | 1.000 | 0.665 | 0.624 | 0.631 | 0.257 | 0.263 | 0.117 |
| Openness | 0.665 | 1.000 | 0.655 | 0.516 | 0.186 | 0.277 | 0.097 |
| Extraversion | 0.624 | 0.655 | 1.000 | 0.798 | 0.277 | 0.001 | -0.040 |
| Agreeableness | 0.631 | 0.516 | 0.798 | 1.000 | 0.103 | 0.038 | -0.125 |
| Emotional Stability | 0.257 | 0.186 | 0.277 | 0.103 | 1.000 | 0.141 | 0.114 |
| Cognitive Ability | 0.263 | 0.277 | 0.001 | 0.038 | 0.141 | 1.000 | 0.352 |
| Income | 0.117 | 0.097 | -0.040 | -0.125 | 0.114 | 0.352 | 1.000 |
A pairplot of the dataset is given as follows (marked by Income Class)
# seaborn for pairplot
import seaborn as sns
# sns.set() to make it look nice
sns.set()
sns.pairplot(df, hue='Income Class', markers=['o', '.', 'X', '^', '<', '>']);
A plot divided using Gender is given as follows:
sns.pairplot(df, hue='Gender', markers=['<', '>']);
A plot divided using Education Level is given as follows:
sns.pairplot(df, hue='Education Level', markers=['o', '.', 'X', '<', '>']);
The dataframe can be output to a csv file called dataset.csv using the to_csv function from the pandas package:
# make index false to omit row numbers, make floating point
# numbers max 2 decimal places
df.to_csv("dataset.csv", index=False, float_format='%.2f')
[1] Duckworth, A., Weir, D., Tsukayama, E. and Kwok, D., 2012. Who Does Well in Life? Conscientious Adults Excel in Both Objective and Subjective Success. Frontiers in Psychology, 3.
[2] Doi.apa.org. 2020. APA Psycnet. [online] Available at:
https://doi.apa.org/doiLanding?doi=10.1037%2F0003-066X.51.2.77 [Accessed 26 November 2020].
[3] Psychology.emory.edu. 2020. Interval. [online] Available at: http://www.psychology.emory.edu/clinical/bliwise/Tutorials/SOM/smmod/scalemea/print2.htm [Accessed 26 November 2020].
[4] Psychology Today. 2020. Big 5 Personality Traits. [online] Available at: https://www.psychologytoday.com/ie/basics/big-5-personality-traits [Accessed 26 November 2020].
[5] Reflectd. 2020. A Look Into Personality And The Big Five Personality Traits. [online] Available at: https://reflectd.co/2013/03/22/what-is-personality-does-it-change/ [Accessed 26 November 2020].
[6] Ortiz-Ospina, E., Tzvetkova, S. and Roser, M., 2020. Women’S Employment. [online] Our World in Data. Available at: https://ourworldindata.org/female-labor-supply [Accessed 11 December 2020].
[7] Tamborini, C., Kim, C. and Sakamoto, A., 2015. Education and Lifetime Earnings in the United States. Demography, 52(4), pp.1383-1407.
[8] Research.stlouisfed.org. 2020. Education Income And Wealth | St. Louis Fed. [online] Available at: https://research.stlouisfed.org/publications/page1-econ/2017/01/03/education-income-and-wealth/ [Accessed 18 December 2020].
[9] Noftle, E. and Robins, R., 2007. Personality predictors of academic outcomes: Big five correlates of GPA and SAT scores. Journal of Personality and Social Psychology, 93(1), pp.116-130.
[10] Bergold, S. and Steinmayr, R., 2018. Personality and Intelligence Interact in the Prediction of Academic Achievement. Journal of Intelligence, 6(2), p.27.
[11] Numpy.org. 2020. Numpy.Random.Multivariate_Normal — Numpy V1.19 Manual. [online] Available at: https://numpy.org/doc/stable/reference/random/generated/numpy.random.multivariate_normal.html [Accessed 26 November 2020].
[12] list, P., Nanda, S. and López, R., 2020. Pretty Print 2D Python List. [online] Stack Overflow. Available at: https://stackoverflow.com/questions/13214809/pretty-print-2d-python-list [Accessed 26 November 2020].
[13] Medium, 2020, Let Us Understand The Correlation Matrix And Covariance Matrix. [online] Available at: https://towardsdatascience.com/let-us-understand-the-correlation-matrix-and-covariance-matrix-d42e6b643c22. [Accessed 16 December 2020]
[14] Unf.edu. 2020. [online] Available at: https://www.unf.edu/~cwinton/html/cop4300/s09/class.notes/DiscreteDist.pdf [Accessed 11 December 2020].